Full-Text Search & Elasticsearch
From a simple ILIKE query that suffocates under millions of rows, to relevance-ranked millisecond search — this chapter explains why tools like Elasticsearch exist, how the inverted index works, what BM25 does, and when to reach for Postgres FTS versus Elasticsearch in a real backend.
The Origin Story — Why Search Got Hard
Imagine it's 2005. You're a backend engineer at a fast-growing e-commerce startup. The product catalog has ~5,000 items. You write a search endpoint using a classic SQL pattern:
SQL-- Works great at 5 000 rows SELECT * FROM products WHERE name ILIKE '%laptop%' OR description ILIKE '%laptop%';
This returns results in roughly 50 ms. Users are happy, managers are happy, life is simple.
Fast-forward two years. The catalog explodes to millions of products. The same query now takes 30 seconds. The company is losing sales. On top of the speed issue, new requirements arrive:
- Relevance — searching "laptop" should show a MacBook Pro before a laptop bag.
- Typo tolerance — users type lapto during a sale rush; results must still appear.
- Speed — search must return in milliseconds, not seconds.
The Librarian Analogy
Think of your relational database (Postgres, MySQL, etc.) as a librarian in a giant library. The librarian knows exactly where every book lives on the shelf.
But it has one fatal flaw: to find books about a topic, it must physically walk to every single shelf, pull out every book, and read it page by page looking for your keyword.
The two problems summarised:
- Speed: A table with 50 million rows means 50 million row comparisons. On spinning disk this was catastrophic even in the SSD era it wastes enormous I/O.
- No relevance: A book whose title is "Introduction to Machine Learning" ranks identically to a book that merely mentions "machine" once in the appendix. The database has zero concept of importance.
Why ILIKE '%term%' Cannot Use Indexes
Postgres B-tree indexes work by sorting values. A B-tree can locate name LIKE 'lapt%' (prefix match) efficiently because sorted strings share prefixes. But ILIKE '%laptop%' has a leading wildcard — there is no useful prefix to sort on. The engine must fall back to a sequential scan.
% in a LIKE/ILIKE expression disables B-tree index usage. Postgres will read every single page of the heap.What the query plan looks like
SQL — EXPLAIN ANALYZEEXPLAIN ANALYZE SELECT id, name FROM products WHERE name ILIKE '%laptop%'; -- Output (simplified) -- Seq Scan on products (cost=0.00..18450.00 rows=5 width=36) -- Filter: ((name)::text ~~* '%laptop%'::text) -- Rows Removed by Filter: 4 999 995 -- Planning Time: 0.2 ms -- Execution Time: 28 940.7 ms ← 29 seconds!
Notice Seq Scan — every row examined, 4,999,995 rows thrown away, execution time ~29 s.
The Inverted Index — The Core Invention
"Instead of going through every document to find the term, maintain a map from every term to the documents that contain it."
This insight — inverting the search — is what the name captures. A normal index maps document → words. An inverted index maps word → documents.
How It Is Built
When a document is first stored (or updated), the search engine runs it through an analysis pipeline:
- Tokenisation — break text into individual tokens (words). "Introduction to Machine Learning" →
["introduction", "to", "machine", "learning"] - Normalisation / Lowercasing — convert to lowercase so "Machine" and "machine" are the same token.
- Stop-word removal — drop common words with no discriminative value ("to", "the", "a"). (Optional, configurable.)
- Stemming / Lemmatisation — reduce words to their root: "running" → "run", "searches" → "search". So a query for "searching" also matches "searched".
- Index entry creation — for each resulting term, record the document ID and position.
Elasticsearch — What It Is & How It Works
Elasticsearch is a distributed, JSON-document search engine built on top of Apache Lucene — the battle-tested Java library that implements the inverted index and BM25 scoring. Lucene has been around since 1999; Elasticsearch (2010) wraps it in a REST API and adds distributed clustering, replication, and a rich query DSL.
Key characteristics:
- Schema-less JSON documents — like MongoDB, you store JSON. Fields are auto-detected (or explicitly mapped).
- Near real-time — newly indexed documents are searchable within ~1 second (configurable).
- Horizontal scaling — data is split into shards (sub-indexes), each shard is a complete Lucene index. More nodes = more throughput.
- Powerful query DSL — JSON-based. Supports full-text, fuzzy, range, geo-distance, aggregations, etc.
Creating an Index with Field Mapping
HTTP / Elasticsearch DSLPUT /products { "mappings": { "properties": { "name": { "type": "text", // analysed — tokenised, stemmed "boost": 3 // title matches are 3× more relevant }, "description": { "type": "text", "boost": 1.5 }, "category": { "type": "keyword" // not analysed — exact match only }, "price": { "type": "float" } } } }
The difference between text and keyword is critical:
| Mapping Type | Analysed? | Use For |
|---|---|---|
text | Yes — tokenised, stemmed | Product names, descriptions, reviews |
keyword | No — stored as-is | Status codes, tags, category IDs |
A Full-Text Search Query
HTTP / Elasticsearch DSLGET /products/_search { "query": { "multi_match": { "query": "laptop", "fields": ["name^3", "description^1.5", "category"], "fuzziness": "AUTO" // typo tolerance } }, "size": 10 }
^3 is field boosting — matches in name contribute 3× as much to the relevance score as matches in the default field weight.
BM25 — How Relevance Scoring Works
BM25 (Best Match 25) is the default ranking function in Elasticsearch (and also in Postgres FTS). It produces a floating-point score per document; higher = more relevant. The full formula has several components — but conceptually four factors drive the score:
| Factor | What It Measures | Effect on Score |
|---|---|---|
| Term Frequency | Count of query term in document | Higher = more relevant (with diminishing returns) |
| Inverse Doc Frequency | Rarity of term across index | Rare terms carry more signal than common ones |
| Document Length | Length vs. avg document length | Short docs with the term rank higher than long docs |
| Field Boost | Which field the term appears in | Title match > body match (configurable multiplier) |
Why IDF Matters
Consider the word "the" — it appears in virtually every English document, so it has very high document frequency and a very low IDF weight. It contributes almost nothing to scores. A rare word like "photovoltaic" or "elasticsearch" appearing in a document is a strong signal of topical relevance.
Typo Tolerance — Fuzzy Search
Elasticsearch uses Levenshtein edit distance under the hood for fuzzy matching. Edit distance is the minimum number of single-character operations (insert, delete, substitute) needed to transform one word into another.
lapto→laptop: 1 insertion → edit distance 1 ✓treading→trending: 1 substitution → edit distance 1 ✓mcbook→macbook: 1 insertion → edit distance 1 ✓
Elasticsearch DSL — fuzzy query{ "query": { "fuzzy": { "name": { "value": "lapto", "fuzziness": "AUTO", // 0 for 1-2 chars, 1 for 3-5, 2 for 6+ "prefix_length": 1 // first N chars must match exactly } } } }
"fuzziness": "AUTO" is the recommended setting. It automatically adjusts the allowed edit distance based on word length. prefix_length: 1 prevents matching completely unrelated words that happen to be 1–2 edits away.
The ELK Stack — Logs & Observability
Elasticsearch isn't only for product search. It's the "E" in the famous ELK Stack used for centralised log management and observability.
If your company already runs ELK for logging, adding product/content search on top of the same Elasticsearch cluster is a natural, cost-effective choice — you're already paying for the infrastructure and your ops team already knows it.
Postgres Full-Text Search in Go
Postgres has had native full-text search since version 8.3 via the tsvector / tsquery types. It uses its own inverted index (the GIN index) and a BM25-variant scorer called ts_rank. It's not as feature-rich as Elasticsearch but perfectly adequate for moderate search needs on data already in Postgres.
DDL — Add a GIN Index
SQL-- Add a generated tsvector column and index it ALTER TABLE products ADD COLUMN search_vec tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(name, '')), 'A') || setweight(to_tsvector('english', coalesce(description, '')), 'B') ) STORED; CREATE INDEX idx_products_fts ON products USING GIN(search_vec);
setweight('A') means title matches outrank description matches — the Postgres equivalent of field boosting.
Go — Full-Text Search Handler
Gopackage main import ( "context" "fmt" "log" "net/http" "github.com/jackc/pgx/v5/pgxpool" ) type Product struct { ID int Name string Description string Rank float64 } func searchHandler(pool *pgxpool.Pool) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { query := r.URL.Query().Get("q") if query == "" { http.Error(w, "missing query param q", http.StatusBadRequest) return } // plainto_tsquery converts plain text safely (no special chars needed) // websearch_to_tsquery also supports AND/OR/-term syntax sql := ` SELECT id, name, description, ts_rank(search_vec, plainto_tsquery('english', $1)) AS rank FROM products WHERE search_vec @@ plainto_tsquery('english', $1) ORDER BY rank DESC LIMIT 20 ` rows, err := pool.Query(context.Background(), sql, query) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } defer rows.Close() var results []Product for rows.Next() { var p Product rows.Scan(&p.ID, &p.Name, &p.Description, &p.Rank) results = append(results, p) } w.Header().Set("Content-Type", "application/json") fmt.Fprintf(w, "found %d results\n", len(results)) } } func main() { pool, _ := pgxpool.New(context.Background(), "postgres://...") http.HandleFunc("/search", searchHandler(pool)) log.Fatal(http.ListenAndServe(":8080", nil)) }
@@ operator is Postgres's full-text match operator. It uses the GIN index — query time stays O(log n) regardless of table size.Elasticsearch in Python (Official SDK)
The official Python client is elasticsearch-py. Install:
Shellpip install elasticsearch
Index Setup & Bulk Insert
Pythonfrom elasticsearch import Elasticsearch, helpers import csv es = Elasticsearch( cloud_id="YOUR_CLOUD_ID", api_key="YOUR_API_KEY" ) INDEX = "reviews" # 1. Create index with explicit mapping if es.indices.exists(index=INDEX): es.indices.delete(index=INDEX) es.indices.create(index=INDEX, body={ "mappings": { "properties": { "review": {"type": "text"}, # analysed full-text "sentiment": {"type": "keyword"} # exact: "positive" / "negative" } } }) # 2. Bulk insert from CSV def generate_docs(path: str): with open(path) as f: reader = csv.DictReader(f) for row in reader: if row.get("review") and row.get("sentiment"): yield { "_index": INDEX, "_source": { "review": row["review"], "sentiment": row["sentiment"] } } success, errors = helpers.bulk(es, generate_docs("reviews.csv")) print(f"Inserted {success} documents, {len(errors)} errors")
Search with Fuzzy + Field Boost
Pythondef search_reviews(query: str, sentiment_filter: str = None) -> list: must_clauses = [{ "multi_match": { "query": query.lower(), "fields": ["review"], "fuzziness": "AUTO", "operator": "and" # all terms must appear } }] filter_clauses = [] if sentiment_filter: filter_clauses.append({ "term": {"sentiment": sentiment_filter} }) body = { "query": {"bool": {"must": must_clauses, "filter": filter_clauses}}, "size": 20 } resp = es.search(index=INDEX, body=body) return [ {"score": hit["_score"], **hit["_source"]} for hit in resp["hits"]["hits"] ] # Usage results = search_reviews("gret product", sentiment_filter="positive") # "gret" (typo for "great") matched via fuzziness=AUTO (edit distance 1)
Streaming Results from Two Sources (Node.js Pattern)
The demo in the lecture used a Next.js API route that streams results from Postgres and Elasticsearch simultaneously so neither waits for the other. Here is the pattern in pure Node.js:
Node.js (ESM)import { Client } from '@elastic/elasticsearch' import { neon } from '@neondatabase/serverless' export async function GET(req) { const { searchParams } = new URL(req.url) const term = searchParams.get('q') const encoder = new TextEncoder() const stream = new ReadableStream({ async start(controller) { // Fire BOTH queries concurrently — don't await sequentially const [pgResult, esResult] = await Promise.allSettled([ pgSearch(term), esSearch(term), ]) controller.enqueue(encoder.encode(JSON.stringify({ source: 'postgres', ...pgResult }))) controller.enqueue(encoder.encode(JSON.stringify({ source: 'elasticsearch', ...esResult }))) controller.close() } }) return new Response(stream, { headers: { 'Content-Type': 'text/event-stream' } }) }
The Benchmark — ILIKE vs Elasticsearch on 50k Rows
The demo in the lecture populated 50,000 review documents into both Neon (cloud Postgres) and Elastic Cloud, both hosted in us-west to keep network latency equal. Results:
| Query | Elasticsearch | Postgres ILIKE | Speedup |
|---|---|---|---|
| "laptop" | ~500 ms | ~3 s | 6× |
| "only" (broad) | ~500 ms | ~7.5 s | 15× |
The broader the search term (more matching rows), the worse ILIKE gets — because Postgres must retrieve and return all those rows from disk. Elasticsearch computes relevance scores in-memory on the shard and only returns the top-N, so latency stays low.
When to Use What
Postgres Full-Text Search
- Data already in Postgres
- Search is a secondary feature, not the core product
- Team small / no dedicated infra for ES
- Moderate data volume (< a few million rows)
- Don't need fuzzy/typo search on day one
Elasticsearch
- Search is a first-class feature
- Millions+ of documents
- Need typo tolerance / autocomplete
- Complex relevance tuning needed
- Already running ELK for logs anyway
- Need aggregations / analytics on search results
Decision Flow
References & Further Reading
Backend Field Manual · Full-Text Search Chapter · Notes compiled from video lecture